Back to Resources
Papers

Attention Is All You Need

Author: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Author: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Date: June 12, 2017 (Last revised: August 2, 2023)

Link: https://arxiv.org/abs/1706.03762

Attention Is All You Need

Transformer Architecture
Transformer Architecture

Introduction

The paper "Attention Is All You Need" introduced the Transformer, a groundbreaking neural network architecture that revolutionized natural language processing and beyond. This seminal work proposed a new approach to sequence transduction that dispensed with recurrent and convolutional layers entirely, relying solely on attention mechanisms.

Background

Prior to the Transformer, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. While these models achieved impressive results, they had significant limitations in terms of parallelization and training time.

The Transformer Architecture

The Transformer represents a paradigm shift in how we approach sequence modeling. Here are its key innovations:

Self-Attention Mechanism

The core innovation of the Transformer is its reliance on self-attention mechanisms. Unlike recurrent models that process sequences sequentially, self-attention allows the model to weigh the importance of different positions in the input sequence simultaneously.

The attention function can be described as mapping a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

Multi-Head Attention

Instead of performing a single attention function, the Transformer uses multi-head attention, which allows the model to jointly attend to information from different representation subspaces at different positions.

def multi_head_attention(query, key, value, num_heads):
    """
    Multi-head attention mechanism

    Args:
        query: Query tensor
        key: Key tensor
        value: Value tensor
        num_heads: Number of attention heads

    Returns:
        Attention output
    """
    # Split into multiple heads
    d_model = query.shape[-1]
    depth = d_model // num_heads

    # Linear projections
    Q = dense_layer(query, d_model)
    K = dense_layer(key, d_model)
    V = dense_layer(value, d_model)

    # Split heads
    Q = split_heads(Q, num_heads, depth)
    K = split_heads(K, num_heads, depth)
    V = split_heads(V, num_heads, depth)

    # Scaled dot-product attention
    attention_output = scaled_dot_product_attention(Q, K, V)

    # Concatenate heads
    output = concatenate_heads(attention_output)

    return output

Position-wise Feed-Forward Networks

In addition to attention sub-layers, each layer in the encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.

Positional Encoding

Since the Transformer doesn't use recurrence or convolution, positional encodings are added to the input embeddings to inject information about the relative or absolute position of tokens in the sequence.

Key Results

The experimental results demonstrated the superiority of the Transformer architecture:

Machine Translation Performance

  • WMT 2014 English-to-German: Achieved 28.4 BLEU score, improving over existing best results by over 2 BLEU points
  • WMT 2014 English-to-French: Established a new single-model state-of-the-art BLEU score of 41.8 after training for just 3.5 days on eight GPUs

Training Efficiency

One of the most remarkable aspects of the Transformer is its training efficiency. The model required only a small fraction of the training costs of the best models from the literature at the time, while achieving superior results.

Generalization

The paper demonstrated that the Transformer generalizes well to other tasks beyond machine translation, successfully applying it to English constituency parsing with both large and limited training data.

Model Architecture Details

The Transformer follows an encoder-decoder structure:

  • Encoder: Composed of a stack of 6 identical layers, each with two sub-layers (multi-head self-attention and position-wise feed-forward network)
  • Decoder: Also composed of 6 identical layers, but with an additional sub-layer that performs multi-head attention over the encoder's output

Both the encoder and decoder use residual connections around each sub-layer, followed by layer normalization.

Impact and Legacy

The "Attention Is All You Need" paper has had a profound impact on the field of deep learning:

  1. Foundation for Modern NLP: The Transformer architecture became the foundation for models like BERT, GPT, T5, and countless others
  2. Beyond NLP: The architecture has been successfully adapted for computer vision (Vision Transformers), audio processing, and multi-modal learning
  3. Scalability: The parallelizable nature of Transformers enabled the training of increasingly large models
  4. Research Direction: Shifted the research community's focus from recurrent architectures to attention-based models

Technical Innovations

Scaled Dot-Product Attention

The attention mechanism uses a scaling factor to prevent the dot products from becoming too large in magnitude:

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Where d_k is the dimension of the key vectors.

Layer Normalization and Residual Connections

Each sub-layer employs a residual connection followed by layer normalization:

LayerNorm(x + Sublayer(x))

This design choice helps with gradient flow and enables training of deeper networks.

Conclusion

"Attention Is All You Need" introduced a revolutionary architecture that transformed the landscape of deep learning. By demonstrating that attention mechanisms alone could achieve state-of-the-art results without recurrence or convolutions, the Transformer opened new possibilities for model design and set the stage for the modern era of large language models and foundation models.

The paper's impact extends far beyond its original application to machine translation, influencing research across multiple domains and becoming one of the most cited papers in modern AI research.


Citation:

@article{vaswani2017attention,
  title={Attention is all you need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia},
  journal={Advances in neural information processing systems},
  volume={30},
  year={2017}
}